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Abstract 

We study prediction with expert advice in the setting where the losses 

are accumulated with some discounting and the impact of old losses can 

^ J ' gradually vanish. We generalize the Aggregating Algorithm and the Ag- 

J I gregating Algorithm for Regression, propose a new variant of exponen- 

• , tially weighted average algorithm, and prove bounds on the cumulative 

. ■, ' discounted loss. 
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Prediction with expert advice is a framework for online sequence prediction. 
Predictions are made step by step. The quality of each prediction (the discrep- 
ancy between the prediction and the actual outcome) is evaluated by a real 
I _. ^ number called loss. The losses are accumulated over time. In the standard 

f— ^ ■ framework for prediction with expert advice (see the monograph [2\ for a com- 

f^ I prehensive review), the losses from all steps are just summed. In this paper, we 

consider a generalization where older losses can be devalued; in other words, we 
use discounted cumulative loss. 

Predictions are made by Experts and Learner according to Protocol [TJ In 

^ . Protocol 1 Prediction with expert advice under general discounting 

Co := 0. 

£^ := 0, 6* e e. 

for t= 1,2,... do 

Accountant announces at-i G (0, 1]. 

Experts announce 7^ G P, G 0. 

Learner announces 74 G F. 

Reality announces Wt G fJ. 

£?:=at_i£?_i + A(7f,c^t), 0Ge. 

Ct ■■= at-iCt-i + X{jt,^t)- 
end for 

this protocol, fl is the set of possible outcomes and wi, aj2, ^^3 • • ■ is the sequence 



to predict; T is the set of admissible predictions, and A : F x fl — > [0, oo] is the 
loss function. The triple (J7,r, A) specifies the game of prediction. The most 
common examples are the binary square loss, log loss, and absolute loss games. 
They have O = {0, 1} and F = [0, 1], and their loss functions are X^'^{j,uj) = 
(7-w)^ Ai°g(7,0) = -log(l-7) and Ai°s(7,l) = -log7, X^'''il,Lo) - \j-luI 
respectively. 

The players in the game of prediction are Experts 9 from some pool Q, 
Learner, and also Accountant and Reality. We are interested in (worst-case 
optimal) strategies for Learner, and thus the game can be regarded as a two- 
player game, where Learner opposes the other players. The aim of Learner is 
to keep his total loss Ct small as compared to the total losses Cf of all experts 
9€Q. 

The standard protocol of prediction with expert advice (as described in [TU 
[5D]) is a special case of Protocol [1] where Accountant always announces a* = 1, 
i = 0, 1, 2, . . .. The new setting gives some more freedom to Learner's opponents. 

Another important special case is the exponential (geometric) discounting 
at = a G (0,1). Exponential discounting is widely used in finance and eco- 
nomics (see, e.g., [H]), time series analysis (see, e.g., [S]), reinforcement learn- 
ing [IH], and other applications. In the context of prediction with expert advice, 
Freund and Hsu [B] noted that the discounted loss provides an alternative to 
"tracking the best expert" framework [llj . Indeed, an exponentially discounted 
sum depends almost exclusively on the last 0(log(l/a)) terms. If the expert 
with the best one-step performance changes at this rate, then Learner observ- 
ing the a-discounted losses will mostly follow predictions of the current best 
expert. Under our more general discounting, more subtle properties of best 
expert changes may be specified by varying the discount factor. In particular, 
one can cause Learner to "restart mildly" giving at — 1 (or at ~ 1) most of the 
time and a* <C 1 at crucial moments. (We prohibit a^ = in the protocol, since 
this is exactly the same as the stopping the current game and starting a new, 
independent game; on the other hand, the assumption at ^ simplifies some 
statements.) 

Cesa-Bianchi and Lugosi 2, § 2.11] discuss another kind of discounting 

T 

LT^Y.PT-tlt, (1) 

t=l 

where It are one-step losses and j3t are some decreasing discount factors. To see 
the difference, let us rewrite our definition in the same style: 

L"]^ ^ aT—iL^p—x -\- Ix ^ aT—2^T—i-^T—2 '^ ^T—i^T—i '^ It ^ ■ • - 

T T 

= Y^af-aT-ilt = ^Y.P^^^^ (2) 
t=i ^ '^ t=i 

where /3t = l/ai---at_i, /3i = 1. The sequence fit is non- decreasing^ j3i < 
/32 < /Ss < • ■ •; but it is applied "in the reverse order" compared to (P). So, in 
both definitions, the older losses are the less weight they are ascribed. However, 
according to ((!]), the losses It have different relative weights in Lt, Lt+i and so 
on, whereas (I2|) fixes the relative weight of k with respect to all previous losses 
forever starting from the moment t. The latter property allows us to get uniform 



algorithms for Learner with loss guarantees that hold for all T = 1,2,...; in 
contrast. Theorem 2.8 in [2| gives a guarantee only at one moment T chosen 
in advance. The only kind of discounting that can be expressed both as ([T]) 
and as ^ is the exponential discounting X]t=i oF~'*'lt. Under this discounting, 
NormalHedge algorithm is analysed in [B]; we briefly compare the obtained 
bounds in Section |31 

Let us say a few words about "economical" interpretation of discounting. 
Recall that aj < 1 in Protocol [TJ in other words, the previous cumulative loss 
cannot become more important at later steps. If the losses are interpreted as the 
lost money, it is more natural to assume that the old losses must be multiplied 
by something greater than 1. Indeed, the money could have been invested and 
have brought some interest, so the current value of an ancient small loss can be 
considerably large. Nevertheless, there is a not so artificial interpretation for 
our discounting model as well. Assume that the loss at each step is expressed as 
a quantity of some goods, and we pay for them in cash; say, we pay for apples 
damaged because of our incorrect weather prediction. The price of apples can 
increase but never decreases. Then /3j in Q is the current price, X]t=i Pth is 
the total sum of money we lost, and Lt is the quantity of apples that we could 
have bought now if we had not lost so much money. (We must also assume that 
we cannot hedge our risk by buying a lot of cheap apples in advance — the apples 
will rot — and that the bank interest is zero.) 

We need the condition at < 1 for our algorithms and loss bounds. However, 
the case of at > 1 is no less interesting. We cannot say anything about it and 
leave it as an open problem, as well as the general case of arbitrary positive at- 

The rest of the paper is organized as follows. In Section [51 we propose a 
generalization of the Aggregating Algorithm '5{JI and prove the same bound as 
in [50] but for the discounted loss. In Section [31 we consider convex loss func- 
tions and propose an algorithm similar to the Weak Aggregating Algotihm [T3] 
and the exponentially weighted average forecaster with time-varying learning 
rate [H § 2.3], with a similar loss bound. In Section [H we consider the use 
of prediction with expert advice for the regression problem and adapt the Ag- 
gregating Algorithm for Regression [35] (applied to spaces of linear functions 
and to reproducing kernel Hilbert spaces) to the discounted square loss. All 
our algorithms are inspired by the methodology of defensive forecasting [J . We 
do not explicitly use or refer to this technique in the main text. However, to 
illustrate these ideas we provide an alternative treatment of the regression task 
with the help of defensive forecasting in Appendix IA.2I 

2 Linear Bounds for Learner's Loss 

In this section, we assume that the set of experts is finite, = {1, . . . , K}, and 
show how Learner can achieve a bound of the form Ct < c£^ -t- {c\ivK)/rj for 
all Experts fc, where c > 1 and 77 > are constants. Bounds of this kind were 
obtained in [12]. Loosely speaking, such a bound holds for certain c and 77 if 
and only if the game (fi, F, A) has the following property: 

37erVa;ef^ A(7,a;)<--ln(y^Ke-''^('^"'^) I (3) 
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for any finite index set /, for any 7.^ € T, i € I, and for any pi € [0, 1] such that 
'^iaiPi = 1- It turns out that this property is sufficient for the discounted case 
as well. 

Theorem 1. Suppose that the game {fl,T,X) satisfies condition ([3]) for certain 
c > 1 and 77 > 0. In the game played according to Protocol[l\ Learner has a 
strategy guaranteeing that, for any T and for any k G {1, . . . , K}, it holds 

Ct < c£i H . 4) 



We formulate the strategy for Learner in Subsection 12.11 and prove the the- 
orem in Subsection 12.21 

For the standard undiscounted case (Accountant announces a* = 1 at each 
step t) , this theorem was proved by Vovk in [IH] with the help of the Aggregating 
Algorithm (AA) as Learner's strategy. It is known ((TUl [5D]) that this bound 
is asymptotically optimal for large pools of Experts (for games satisfying some 
assumptions): if the game does not satisfy ([3|) for some c > 1 and r] > 0, then, 
for sufhciently large K, there is a strategy for Experts and Reality (recall that 
Accountant always says at = I) such that Learner cannot secure (|3|). For the 
special case of c = 1, bound ^ is tight for any fixed K as well [21]. These results 
imply optimality of Theorem [1] in the new setting with general discounting 
(when we allow arbitrary behaviour of Accountant with the only requirement 
at G (0, 1]). However, they leave open the question of lower bounds under 
different discounting assumptions (that is, when Accountant moves are fixed); 
a particularly interesting case is the exponential discounting at = a G (0, 1). 

2.1 Learner's Strategy 

To prove Theorem [U we will exploit the AA with a minor modification. 

Algorithm 1 The Aggregating Algorithm 

Initialize weights of Experts Wq :— 1/K, k — 1, . . . ,K. 
for i= 1,2,... do 

Get Experts' predictions jf ^ r,k = 1, . . . ,K. 

Calculate gt(a;) ^ -^ In (j2k=i w^_^e~'^^^''t ''^A , for all uj e Q. 

Output 7t := a{gt) G F. 

Get ujt G fl. 

Update the weights w^ := u;*'_ie-''^(T''"*\ k^l,...,K, 

and normalize them w^ := w^ / X]fc=i ''^ti ^ ~ 1, . . . , K. 
end for. 

The pseudocode of the AA is given as Algorithm [TJ The algorithm has three 
parameters, which depend on the game (r2,F,A): c > 1, 77 > 0, and a function 
a : M. — )• F. The function a is called a substitution function and must have the 
following property: X{a{g),uj) < g{uj) for all w G fi if for g G M^ there exists any 
7 G F such that A(7, w) < g{u}) for all w G H. A natural example of substitution 
function is given by 

a{g) = arg min(A(7, w) - 5(0;)) (5) 

7^1 



(if the minimum is attained at several points, one can take any of them). An 
advantage of this a is that the normahzation step in hne[S]is not necessary and 
one can take w^' = Wi- Indeed, multiplying all w^ by a constant (independent 
of k) we add to all gt{^) a constant (independent of w), and (j{gt) does not 
change. 

The Aggregating Algorithm with Discounting (AAD) differs only by the use 
of the weights in the computation of gt and the update of the weights. 

The pseudocode of the AAD is given as Algorithm [21 

Algorithm 2 The Aggregating Algorithm with Discounting 

1: Initialize weights of Experts uiq := 1, fc = 1, . . . , K. 
for i= 1,2,... do 

Get discount at~i G (0, 1]. 

Get Experts' predictions "fi <^T,k — 1, . . . ,K. 

Calculate gt(a;) = -^ (lnEf=i ;^(wti)"'"'e-''^(^^'^)) , for all w e f7. 

Output 7t := a{gt) G T. 
Get LOt e Vt. 

Update the weights w'l := (^ujk_^)o,t^i^nKit.u^t)/c-7^>~(it.^t)^ k^l,...,K, 
end for. 

For a substitution function satisfying ([5]), one can use in line [5] the update 
rule Wf := (t(;^_]^)"'-'^e~'''^''"''' '"'•', which does not contain Learner's losses, in 
the same manner as the normalization in Algorithm [1] can be omitted. 

2.2 Proof of the Bound 

Assume that c and r/ are such that condition ^ holds for the game. Let us 
show that Algorithm [5] preserves the following condition: 

E ^-' < 1 • (6) 

fc=i 

Condition ^ trivially holds for t = 0. Assume that © holds for t — 1, that is, 
Sfe=i ^t-i/^ — 1- Thus, we have 

E^«ir-^ E^<i -'' 

k=l \fc=l / 

since the function a; i~> x" is concave for a £ (0, 1], a; > 0, and since a; < 1 
implies x" < 1 for a > and a; > 0. 

Let w'' be any reals such that w'^ > {wf_i)°'^-^ / K and J2k=i ^'^ ~ ^- ^^® 
to condition ([3]) there exists 7 e F such that for all oj e 57 

K 



-^(Tiw) < In y ^ w"e 






^ \fc=l 



' \fc=i 



(the second inequality holds due to our choice of w''). Thus, due to the property 
of a, we have A(7t, uj) < gt{<jj) for all w G SI. In particular, this holds for cj = cjt, 
and we get 



A(7*,^*)<--ln(E^(u;^l^-e-''^(^^-) 



\fc=i / 

which is equivalent to ([6]). 

To get the loss bound @, it remains to note that 

Indeed, for t — Q, this is trivial. If this holds for lujLi, then 

\D.w^t = af_iln(u;^_i) +?7A(7f,wt)/c- 77A(7f,Wt) 

= at_i?7 [Ct-i/c - £f_i) + ryA(7t, wt)/c - ■r]X{-ft,ujt) 
- V ((«t_iA_i + Xi-/t,L^t))/c - {at-iCt^ + A(7^ iot))) = 77 (A/c - /:^) 

and we get the equality for w^. Thus, condition © means that 

E ^e"(^*/^-^n < 1 , (7) 

fe=i 

and (|4]) follows by lower-bounding the sum by any of its terms. 

Remark. Everything in this section remains valid, if we replace the equal ini- 
tial Experts' weights 1/K by arbitrary non-negative weights w'^, X]fc=i "^^ ~ ^■ 
This leads to a variant of ^ , where the last additive term is replaced by - In ^ . 
Additionally, we can consider any measurable space Q of Experts and a non- 
negative weight function w{9), and replace sums over K by integrals over Q. 
Then the algorithm and its analysis remain valid (if we impose natural in- 
tegrability conditions on Experts' predictions 7^; see [22 for more detailed 
discussion) — this will be used in Section U) 

3 Learner's Loss in Bounded Convex Games 

The linear bounds of the form (|4]) are perfect when c = 1. However, for many 
games (for example, the absolute loss game), condition ^ does not hold for 
c = I (with any 77 > 0), and one cannot get a bound of the form Ct < £^"' + 0(1). 
Since Experts' losses C^ may grow as T in the worst case, any bound with c > 1 
only guarantees that Learner's loss may exceed an Expert's loss by at most 0{T). 
However, for a large class of interesting games (including the absolute loss game), 
one can obtain guarantees of the form Ct < C^ + 0{^/T) in the undiscounted 
case. In this section, we prove an analogous result for the discounted setting. 

A game (r2,r. A) is non-empty if 51 and T are non-empty. The game is 
called hounded if L = max^^^^ A(7,i:l') < 00. One may assume that L — 1 
(if not, consider the scaled loss function X/L). The game is called convex if 



for any predictions 71,..., 7m G T and for any weights pi,...,pM S [0,1], 



^M 



Z^m=lP™ — 1: 



M 



37erVweO A(7,cj) < ^p„A(7™,w). (8) 

m—l 

Note that if F is a convex set (e. g.,r = [0,1]) and X{j,lu) is convex in 7 
(e. g., A'^''''), then the game is convex. 

Theorem 2. Suppose that (57, F, A) is a non-empty convex game, and A(7, uj) G 
[0,1] for all ^ € T and oj £ i7. In the game played according to Protocol\^ 
Learner has a strategy guaranteeing that, for any T and for any /c S {1, . . . , K} , 
it holds 




-T 



T 



(9) 



where Pt — l/(ai • • • at-i) and Bt ~ J2t=i f^t- 



Note that Bt / Pt is the maximal predictors' loss, which incurs when the 
predictor suffers the maximal possible loss ^t = 1 at each step. 

In the undiscounted case, at — 1, thus Pt = \., Bt = T, and ^ becomes 



£t <CT + VT\nK. 



A similar bound (but with worse constant y/2 instead of 1 before \/T\uK) is 
obtained in [2, Theorem 2.3]: 



Ct <£^ + \/2TlnX °^ 

For the exponential discounting at — a, we have /3t = a^*+^ and Bt = 
(1 — a~^)/(l — 1/a), and ([9|) transforms into 



l-a^ ,,. InK 



CT<ci + VhrK\ </:^ 



1 — a y I — a 

A similar bound (with worse constants) is obtained in [6 for NormalHedge: 



, 81n2.32J^ 

^^-^^ + V 1-a • 

The NormalHedge algorithm has an important advantage: it can guarantee the 
last bound without knowledge of the number of experts K (see [3] for a precise 
definition). We can achieve the same with the help of a more complicated 
algorithm but at the price of a worse bound (Theorem [3]). 

3.1 Learner's Strategy for Theorem [2] 

The pseudocode of Learner's strategy is given as Algorithm [31 It contains a 
constant a > 0, which we will choose later in the proof. 

The algorithm is not fully specified, since lines |BH7] of Algorithm [3] allow 
arbitrary choice of 7 satisfying the inequality. The algorithm can be completed 



with the help of a substitution function a as in Algorithm [51 so that lines |BHH] 
are replaced by 



1 ( ^ \ 
5*H=--ln E]f K~i) 

'* \k=\ 



k \"t-i')t/')t-i -,,tA(7^^)-7,,V8 



e 



and 7t = o'igt)- However, the current form of Algorithm [3] emphasizes the 
similarity to the Algorithm [SJ which is described later (Subsection 13. 3p but 
actually inspired our analysis. 

Algorithm 3 Learner's Strategy for Convex Games 
1; Initialize weights of Experts Wq := 1, fc = 1, . . . , K. 

Set /?i = 1, Bo = 0. 
2: for i= 1,2,... do 

Get discount at^i G (0, 1]; update /St — /3t_i/at_i, Bt — i?t-i + f3t. 

Gompute 774 = a^JPt/Bt. 

Get Experts' predictions 7^'"' e F, fc = 1, . . . , ii'. 

Find 7 € F s.t. for ah w G f7 

A(7, ..) < -i In (Ef=i j, {wUr-'''"'-' e-''*^(-^-)-''?/8) 
Output 7t := 7. 
Get cjj e Vt. 

10; Update the weights w^^ := (u;ti)"*-''"^'"-' g,,.(A(7*,-*)-A(7^-*))-'7f/8^ 
11: k = l,...,K, 

12: end for. 

Let us explain the relation of Algorithm |3] to the Weak Aggregating Algo- 
rithm I14| and the exponentially weighted average forecaster with time- varying 
learning rate P", § 2.3]. To this end, consider Algorithm U) 

Algorithm 4 Weak Aggregating Algorithm with Discounting 
1: Initialize Experts' cumulative losses Cq :—{), k ~ \, . . . ,K. 
Set Pi = 1, Bo = 0. 
for i= 1,2,... do 



Get discount at-i G (0, 1]; update jSt — /3t_i/at_i, Bt = Bt-i + /3f 
Compute rjt = a^JPt/Bt. 

Compute the weights gf = e^"'"^''*^*-!, k = 1,. . . ,K. 
Compute the normalized weights w^ — q^ /X^i^i It ■ 
Get Experts' predictions jf^r,k — l,...,K. 
Find 7 e F s.t. for all uj e n X{j,uj) < J2k=i ''^tHlt^'-j)- 
Output 7t := 7. 
Get ujt € ^■ 

Update Cf := at-iCt_i + X{jt,ujt), k ^ I,. ..,K. 
end for. 



The proof of Theorem [5] implies that Algorithm 2] is a special case of Al- 
gorithm |31 Indeed, (IT5|) implies that Wt_i = e~''*-i'^'-i+'-^, where C does not 
depend on k and Wt_i are the weights from Algorithm |31 Therefore q^ = 



C {w^_i)°^-^^*/^'-^ , where C" does not depend on k, and one can take w^ for 
w'' in the proof of Theorem [21 Thus, if Algorithm 2] output some jt then Algo- 
rithm [3] can output this 7t as well. 

Recall that if at = 1 for all t (the undiscounted case), /3t = 1 and Bt = t, 
hence r/t = a/^/i. In this case, Algorithm |4] is just the Weak Aggregating 
Algorithm as described in [T^ . 

Consider now the case when F is a convex set and A(7,aj) is convex in 7. 

Then one can take 74 = J2k=i ''^tlt i^- Algorithm^ For at = 1, we get exactly 
the exponentially weighted average forecaster with time- varying learning rate [U 
§ 2.3]. 

3.2 Proof of Theorem [2] 

Similarly to the case of the AAD, let us show that Algorithm [3] always can find 
7 in lines |BHZ] and preserves the following condition: 



E ht < 1 . (10) 



k=l 



First check that at-irjt/ilt-i < 1- Indeed, at-i — Pt-i/Pti and thus 



Vt ^ I3t-i a^K/Bt ^ Pt-iBt-2_ ^ / Bt-i ^ ^ 

(11) 
Condition (flUl) trivially holds for t = 0. Assume that pH)) holds for i — 1, 

that is, J2k=i ^t-i/^ — 1- Thus, we have 



fe=l \A;=1 / 



at-l-nt/vt-l 

<1, (12) 



since the function x 1-^ x"' is concave for a e (0, 1], a; > 0, and since x < 1 
implies x" < 1 for a > and x > 0. 

Let w'' be any reals such that w'' > (w^^^)"'*-^'^'/'^'-^ /K and J2k=i ^'^ = 1- 
(For example, w'' = (y;fe_^)at_i,,t/,,t_i /^^^^(w|_-^)"*-i''*/')'-i .) By the Ho- 
effding inequality (see, e.g., 2, Lemma 2.2]), wc have 

lnf^^V*^(^*'") <-77tf^^'^A(7^u;)^-|, (13) 

fc=l k=l 

since A(7,cl') e [0, 1] for any 7 G F and w G 57. Since the game is convex, there 

exists 7 G F such that A(7, w) < J2k=i ''^^^{it t^) fo^' all w G 51. For this 7 and 
for all w G ri we have 



K 1 /^ 

A(7,c.) < ^«)'=A(7^^) < In E*'^" 

fe=i ^* \k=i 



vM7t,^)-Vti 



< _lin (y^ ^ (wti)"'""^"" e-,.A(^^.)-,?/8^ (14) 



(the second inequality follows from P^ . and the third inequality holds due to 
our choice oi w''). Thus, one can always find 7 in lines [HHT] of Algorithm [31 It 
remains to note that the inequality in line [7] with 74 substituted for 7 and ujt 
substituted for cj is equivalent to 



K 

Now let us check that 



K 



lnwf=Vt{Ct-Cl)~^Y.^rVr 



(15) 



r = l 



Indeed, for i = 0, this is trivial. Assume that it holds for w^_i. Then, taking the 
logarithm of the update expression in line [10] of Algorithm [3| and substituting 
lnu;^_j, we get 



mWf — 

Vt-i 



Inwti + Vt{Klt,ujt) - X{lt,^t)) 



m 



at-iVt 



fjt-i iCt-i - Cti) 



Vt-i 



Y^ /3rVr +Vt (A(7t, Wt)-A(7f , Wt))- 



w 



7]t (at-iA-1 + A(7t, wt) - at-i£t-i - Klt:^t)) - w^Yl ^^^^ 



Vt 



^rj,{C^-Cl)-J^Y.I^rVr. 



Condition (|TU)) implies that w^ < K for all k and T, hence we get a loss 
bound 



,. InK 1 V- ^ 



Vt 



(16) 



t=i 



Recall that r]t = ay/Pt/Bf. To estimate J2t=i PtVt, we use the following 
inequality (see Appendix I A. II for the proof). 

Lemma 1. Let Pt he any reals such that 1 < /3i < /32 < ■ ■ ■• Let Bt = X]t=i /^t- 
Then, for any T , it holds 



/3' 



J-E/3-/-<2 



t=i 




Then p6p implies 



^^<^. ,ln/^ JBr ,2a /S. _ ,, 



a V Pt 8 V /3i 









Choosing a = 2vhii^, we finally get 



PI 



10 



3.3 A Bound with respect to e-Best Expert 

Algorithm [3] originates in the "Fake Defensive Forecasting" (FDF) algorithm 
from [SJ Theorem 9] . That algorithm is based on the ideas of defensive forecast- 
ing [4], in particular, Hoeffding supermartingales [24 , combined with the ideas 
from an early version of the Weak Aggregating Algorithm [TB]. However, our 
analysis in Theorem [5] is completely different from 'F, following the lines of [H 
Theorem 2.2] and pj.. 

In this subsection, we consider a direct extension of the FDF algorithm 
from [5\ Theorem 9] to the discounted case. Algorithm [5] becomes the FDF 
algorithm when at — 1- 

Algorithm 5 Fake Defensive Forecasting Algorithm with Discounting 
1: Initialize cumulative losses £q — 0, £q := 0, k = 1, . . . , K. 

Set /3i = 1, Bo = 0. 
2: for i = 1,2,. . . do 

3: Get discount at-i G (0, 1]; update fit = Pt-i/at-i, Bt — -Bt-i + (it- 
4: Compute rjt = y/f3t/Bt. 

5: Get Experts' predictions 7^*^ S F, /c = 1, . . . , iiT. 
6: Find 7 G r s.t. for aU a; G 51 ft{l,^) < Ct, 

where ft and Ct are defined by pT]) and p^ . respectively. 
Output 7f :— 7. 
Get uJt G Vl. 
9: Update Ct := at^iCt-i + A(7t,a;t). 
10: Update C^ -.^ at-iC^_i + X{^^ ^uJt)-, k^l,...,K. 
11: end for. 

Algorithm [5] in line [S] uses the function 



K ^ 00 / .9 t-i 






ft(n^^) = XI ;^ X! 72 '^'^P ( jat-i'7t(A-i -/:f_i)- ^Xl^^'^^ 
fc=i j=i " 



■'? 2 



X exp j-77,(A(7,^) - Xillio)) - ^ (17) 



and the constant 



/\_,oo / -2 t — 1 



-1 ^-^^ / ■ z 

fe=l j = l -' \ ^* T = l 



where 1/c = Ei°li ;^- 

Algorithm [S] is more complicated than Algorithm |21 and the loss bound we 
get is weaker and holds for a narrower class of games. However, this bound 
can be stated as a bound for e-quantile regret introduced in [5]. Namely, let 
£( be any value such that for at least eK Experts their loss £^ after step t 
is not greater than £j. The e-quantile regret is the difference between Ct and 
CI- For e — l/K, we can choose Cf = min^ C^ < Ct for all k = 1, . . . ,K, and 
thus a bound in terms of the e-quantile regret implies a bound in terms of Ct ■ 
The value 1/e plays the role of the "effective" number of experts. Algorithm [S] 
guarantees a bound in terms of C^ for any e > 0, without the prior knowledge 
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of e, and in this sense the algorithm works for the unknown number of Experts 
(see [S] for a more detailed discussion). 

For Algorithm [S] we need to restrict the class of games we consider. The 
game is called compact if the set A = {A(7, •) e MP- | 7 e F} is compact in the 
standard topology of R^ . 

Theorem 3. Suppose that (17, F, A) is a non-empty convex compact game, J7 is 
finite, and A(7, lj) e [0, 1] for all ^ d T and oj € 51. In the game played according 
to Protocol [3 Learner has a strategy guaranteeing that, for any T and for any 
e > 0, it holds 

CT<Cfr + 2J^\n- + 7J^, (19) 

where Pt ~ !/(«! • • • at-i) CLnd Bt = J2t=i f^t- 

Proof. The most difhcult part of the proof is to show that one can find 7 in linelS] 
of Algorithm [S] We do not do this here, but refer to [5]; the proof is literally the 
same as in [5J Theorem 9] and is based on the supermartingale property of ft- 
(The rest of the proof below also follows |S| Theorem 9] ; the only difference is 
in the definition of ft and Ct ■ ) 

Let us check that Ct < 1 for all t. Clearly, Ci = 1. Assume that we 
have Ct < 1. This implies ft{'-ft,i^t) < 1 due to the choice of 74, and thus 
(/t(7t,wt))"*'"+i/'" < 1. Similarly to (dU, we have atrjt+i/ilt < 1- Since the 
function a; i— > x" is concave for a G (0, 1], x > 0, we get 



^atVt+l/vt 



l>(/t(7t,^t)) 

E^E::i-p(,,.(A-4)-^5:/3.,, 






^fe = l j = l ■' \ ^"^ T=l 

/ .2 t \ \ "ft+i/'Jt 

K^p ,exph>y,(A-A')-|^^/3.^. 




k=l 3 = 1 

K 



Thus, for each t we have ft{'^t,'-^t) < 1, that is. 




tVt+l/vt 



Ct- 



K 00 / -2 * \ 



K^ f \ 2/3t 

For any e > 0, let us take any £^ such that for at least eK Experts their losses 
£y are smaller than or equal to Lip. Then we have 

K 00 f -2 * \ 

1 ^ E ^ E 1 -p U,t{Lt - CH) - L^ Y. PrVr 
k=l 3 = 1 -^ \ ^* T=l / 
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for any natural j. Taking the logarithm and rearranging, we get 

T = l 



2/3t f-^ JVt ce 



Substituting rjt — ^/Jk/Bt and using Lemma [I] we get 



Ct<Cl+[i + - In j + - In - + - In - 
] J e J c 



Letting j = ■\/ln(l/e) + 1 and using the estimates j < ■\/ln(l/e) + 2, (In j')/j < 

2, (ln(l/e))/j < 0n(l/e), 1/j < 1, and ln(l/c) = ln(7rV6) < 1, we obtain the 
final bound. D 



4 Regression with Discounted Loss 

In this section we consider a task of regression, where Learner must predict 
"labels" yt G K for input instances Xt G X C R". The predictions proceed 
according to ProtocolJH This task can be embedded into prediction with expert 

Protocol 2 Competitive online regression 
for t = 1,2, . . . do 

Reality announces Xt G X. 

Learner announces 74 G F. 

Reality announces j/t G fi. 
end for 

advice if Learner competes with all functions a; — > y from some large class serving 
as a pool of (imaginary) Experts. 

4.1 The Framework and Linear Functions as Experts 

Let the input space be X C M", the set of predictions be F = M, and the 
set of outcomes be fi = [Yi , Y2] . In this section we consider the square loss 
A^'i(7, y) — {j — y)^. Learner competes with a pool of experts Q — R" (treated 
as linear functionals on R"). Each individual expert is denoted by 6' G 8 and 
predicts 0'xt at step t. 

Let us take any distribution over the experts P{d9). It is known from [19] 
that ^ holds for the square loss with c = 1, 77 = ,y ^y p : 

3jer\fyen=[Yi,Y2] {j-yf < --hJ f e~''^'^'''''y^'p{de)] . (20) 
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Denote by X the matrix of size T x n consisting of the rows of the input 
vectors x[, . . . , x'j,. Let also Wt = diag(/3i//3T, ^2/(^X1 • • • , I^t/ Pt), i-e., Wt is 
a diagonal matrix T x T. In a manner similar to [22] . we prove the following 
upper bound for Learner's loss. 

Theorem 4. For any a > 0, there exists a prediction strategy for Learner in 
Protocol\^ achieving, for every T and for any linear predictor 9 e R", 

" +<.||«f + fii421)!,„det(^ + 7). (21) 
//, in addition, \\xt\\oo ^ Z for all t, then 

7r[PT frtPT 

+ a||.|P+"(^-/-^nf^%^ + lV (22) 
4 y a (3t J 

In the undiscounted case (at — 1 for all t), the bounds in the theorem 
coincide with the bounds for the Aggregating Algorithm for Regression [33J 
Theorem 1] with Y2 = Y and Yi = —Y, since, as remarked after Theorem [2j 

Pt = 1 and ( Xlf=i Pt ) / Pt = T in the undiscounted case. Recall also that in 
the case of the exponential discounting {at = a £ (0, 1)) we have Pt ~ ck^*+^ 
and (eLi a) IPt = (l-a^"^)/(l-a) < l/(l-a). Thus, for the exponential 
discounting bound ([2^ becomes 



r*f + ^i2^.n(^^<l^ + l)^ (23, 

4 \ a(l — a) J 

4.2 Functions from an RKHS as Experts 

In this section we apply the kernel trick to the linear method to compete with 
wider sets of experts. Each expert f ^ T predicts fixt). Here J-" is a reproducing 
kernel Hilbert space (RKHS) with a positive definite kernel fc : X x X — > R. For 
the definition of RKHS and its connection to kernels see [T7] . Each kernel defines 

a unique RKHS. We use the notation K^ = \k(xi,Xj)\i^j=\ t for the kernel 

matrix for the input vectors at step T. In a manner similar to [7], we prove the 
following upper bound on the discounted square loss of Learner. 

Theorem 5. For any a > 0, there exists a strategy for Learner in Protocol\^ 
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achieving, for every positive integer T and any predictor f G T, 

+ .||/|P+(^i^lndetr -^^--^ +/). (24) 



Corollary 1. Assume that Cjr — s-wp^^^k{x,x) < oo for the RKHS T . Un- 
der the conditions of Theorem [31 given in advance any constant T such that 
( X]t=i A ) / Pt < T , one can choose parameter a such that the strategy in The- 
orem O achieves for any f ^ T 

t ^h.-mr < E ^^ifM-y^r+{^^^ + !I/IP) ^.vr. (25) 

where c^ — sup^^x k{x, x) < oo characterizes the RKHS T . 

Proof. The determinant of a symmetric positive definite matrix is upper bounded 
by the product of its diagonal elements (see Chapter 2, Theorem 7 in [1), and 
thus we have 



lndet(/+^^^^^^'l<Tln|l 



T 



'-JC- \^llt=l Pt 



1/T^ 



T^ ( TT A^ "^ < r-'^ ^^=1 ^* - '^'^ 



« \ Z-i Pt / a/3T r 



^t=i 



(we use ln(l -\- x) < x and the inequality between the geometric and arithmetic 
means). Choosing a = cjr\/T ^ we get bound (P5|) from ([M)) . D 

Recall again that ( X]t=i /^t ) //^t = (1 — q;^^^)/(1 — a) < 1/(1 — a) in the case 
of the exponential discounting (at = a G (0, 1)), and we can take T — 1/(1 — a). 

In the undiscounted case (at = 1), we have (X]t=i f^t ) I P't — ^' ^o ^^ need 
to know the number of steps in advance. Then, bound (j25l) matches the bound 
obtained in [23, the displayed formula after (33)]. If we do not know an upper 
bound T in advance, it is still possible to achieve a bound similar to (j25p using 
the Aggregating Algorithm with Discounting to merge Learner's strategies from 
Theorem [S] with different values of parameter a, in the same manner as in [231 
Theorem 3]. 

Corollary 2. Assume that c^ = sup^g^ ^(2^7^) < °*2 for the RKHS J- . Under 
the conditions of Theorem [31 there exists a strategy for Learner in Protocol [^ 
achieving, for every positive integer T and any predictor f & J- , 



E |(7* - y^f < E |(/(-o - y.r + cMifim - y.)^^ 

+ ^^^^ln%^ + ll/|P + (r.-yr)^ln(^^(|^ + 2) . (26) 
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Proof. Let us take the strategies from Theorem [S] for a ~ 1,2,3,... and pro- 
vide them as Experts to the Aggregating Algorithm with Discounting, with 
the square loss function, 77 — 2/(1^2 — Yi)"^ and initial Experts' weights propro- 
tional to l/o^. Then Theorem [T] (extended as described in Remark at the end 
of Section [2]) guarantees that the extra loss of the aggregated strategy (com- 
pared to the strategy from Theorem [5] with parameter a) is not greater than 

^ ^~ ^' In ^, where c — 'J2k=i 1/^^- On the other hand, for the strategy from 
Theorem [5] with parameter a similarly to the proof of Corollary [1] we get 

^^^^^it-yt) <^^^{f{-t)-yt) +«ll/ll + ^^ ^^. 

Adding 2 ^^ ~ ^^ ^^^ right-hand side and choosing 



cAy2-yi) j Y.tiPt 
211/11 V Pt 

we get the statement after simple estimations. D 

4.3 Proofs of Theorems [4] and [5] 

Let us begin with several technical lemmas from linear algebra. The proofs of 
some of these lemmas are moved to Appendix lA.ll 

Lemma 2. Let A be a symmetric positive definite m,atrix of size n x n. Let 
e,be R", cbe a real number, and Q{9) = O'AB + b'O + c. Then 



VdeTl 

where Qo = niinggRn (3(0). 

The proof of this lemma can be found in [9, Theorem 15.12.1]. 

Lemma 3. Let A be a symmetric positive definite matrix of size n x n. Let 
b,z eW\ and 

F(A, b, z) = min (0'A9 + b'O + z'd) - min iO' AO + b'6 - z'O) . 

ThenF{A,b,z) = -b'A-^z. 

Lemma 4. Let A be a symmetric positive definite matrix of size n x n. Let 
0,bi,b2 e M", ci,C2 be real numbers, and Qi{9) = 0' AO + b[9 + ci, Q2{0) = 
9'A9 + b'^e + C2. Then 

f p-QiWfj9 



U.e-Q^i^)d9 



e 



The previous three lemmas were implicitly used in |22) to derive a bound on 
the cumulative undiscounted square loss of the algorithm competing with linear 
experts. 
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Lemma 5. For any matrix B of size nxm, any matrix C of size mxn, and any 
real number a such that the matrices aim + CB and ain + BC are nonsingular, 
it holds 

B{alm + CB)-^ = {aIn + BCy^B , (27) 

where /„, /„ are the unit matrices of sizes n x n and m x m, respectively. 

Proof. Note that this is equivalent to (a/„ + BC)B = B{al„i + CB). D 

Lemma 6. For matrix B of size n x m, any matrix C of size m x n, and any 
real number a, it holds 

det(a/„ + BC) = det(a/,„ + CB) , (28) 

where /„, /„ are the unit matrices of sizes n x n and m x m, respectively. 

4.3.1 Proof of Theorem HI 

We take the Gaussian initial distribution over the experts with a parameter 

and use "Algorithm [5] with infinitely many Experts". Repeating the derivations 
from Subsection 12.21 we obtain the following analogue of ([7]) : 

The simple equality 

T T T 

|] |i(0'x, - y,)^ + aPf = e'{al + X'WTX)e - 2^ |iy,0':r, + ^ ^y^. 



t=i 



JT 



'^ Bt ^^ 0t' 



(29) 
shows that the integral can be evaluated with the help of Lemma [2] 






d0 



v/det(a/ + X'WtX) ' 



We take the natural logarithms of both parts of the bound and using the value 
rj = ,y _Y \2 obtain ()2ip . The determinant of a symmetric positive definite 
matrix is upper bounded by the product of its diagonal elements (see Chapter 
2, Theorem 7 in P]): 



det(^:^^+A<p^s-/^* 



and thus we obtain ([2^ . 



17 



4.3.2 Proof of Theorem [5j 

We must prove that for each T and each sequence (xi, j/i, . . . , zt, Vt) G (X x 
R)"^ the guarantee (IM)) is satisfied. Fix T and (xi, j/i, . . . , xt, J/t)- Fix an 
isomorphism between the hnear span of kx-^ , ■ • ■ , k^^, obtained for the Riesz 
Representation theorem and M"^, where T < T is the dimension of the hnear span 
oikx^T ■ ■ ^ kxj. ■ Let Xi, . . . ,xt & R"^ be the images oi kx^, ■ ■ ■ , kx^ , respectively, 
under this isomorphism. We have then k{-,Xi) = {-^Xi) for any Xi. 

We apply the strategy from Theorem 0] to xi, . . . ,xt- The predictions of 
the strategies are the same due to Proposition [1] below. Any expert 6 e M-^ in 
bound (PT|) can be represented as 



T T 

E 

1=1 4=1 



CiXi — y CifZy'^i 



for some Ci G M. Thus the experts' predictions are 9'xt = X)i=i Cik{xt,Xi), and 
the norm is H^lp = J2i,j=i CiCjk{xi,Xj). 

Denote by X the T x T matrix consisting of the rows of the vectors x[, . . . ,x'rp. 
From Lemma [5] we have 

det — + / = det - — — - +l\ . 

Thus using K^ = XX' we obtain the upper bound 

2 



7^1 P^ t=i PT V<=i / 



+ a > CiCjk{xi,Xj) ^ In det \- 1 

i.j=i \ 

for any q e M, i = 1, . . . ,r. By the Representer theorem (see Theorem 4.2 in 
[T7] ) the minimum of ^^^-^ 4^{f{xt) — yt)'^+a\\f\\'^ over all / G J" is achieved on 
one of the linear combinations from the bound obtained above. This concludes 
the proof. 

4.4 Regression Algorithms 

In this subsection we derive explicit form of the prediction strategies for Learner 
used in Theorems 2] and [S] 

4.4.1 Strategy for Theorem |4l 

In [22] Vovk suggests for the square loss the following substitution function 
satisfying ([5]): 

Y2 + Y, 9t{Y2) - gTJYi) 

^^ = —2 2(^2-^1) ■ ^^°^ 

It allows us to calculate gx with unnormalized weights: 

1 f, f ^-v{0'ATe-2e'{bT-i+yxT)+(j:JS,^ f^y'^+y^)) 



gj,{y) = -- In / e""V" ^^"-^" K'^T-i^y^Ti^y^,^, j^y.^y j j ^^ 
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for any y E ft = [Yi, Y2] (here we use the expansion (1^^ ). where 
Ar^al +Y^ ^xtx[ + xtx'j. = al + X'WtX, 

and 6t-i = X^t^i gVtXt- The direct calculation oi gx is inefhcient: it requires 
numerical integration. Instead, we notice that 

^ Y2 + Yi gT{Y2) ~ gT{Yi) 
^^ 2 2(1^2 - Fi) 



■In 



2 2{Y2-Yi)ri r ^-v{0'ATe-2e'{bT-i+Y2XT)+{j:fs,'^vHYi)) 

2 2{Y2 - Yi)tj 

Y2 + Yi\ W_^ 



d9 



bT-l+i-^^jxTJ A^'XT, (31) 

where the third equality follows from Lemma S] 

The strategy which predicts according to pil) requires 0{n^) operations per 
step. The most time-consuming operation is the inverse of the matrix At- Note 
that for the undiscounted case the inverse could be computed incrementally 
using the Sherman-Morrison formula, which leads to 0{n^) operations per step. 

4.4.2 Strategy for Theorem [5j 

We use following notation. Let 

ky be the last column of the matrix Kt, ky = {fc(xi, xt)}^i, /qr,N 
Yt be the column vector of the outcomes Y^ = (j/i, . . . , j/t)'- 

When we write Z = (V; Y) or Z — (V; Y')' we mean that the column vector 
Z is obtained by concatenating two column vectors V, Y vertically or V, Y' 
horizontally. 

As it is clear from the proof of Theorem[5l we need to prove that the strategy 
for this theorem is the same as the strategy for Theorem 2] in the case when the 
kernel is the scalar product. 



Proposition 1. The predictions pip can be represented as 

Y2+Yl\' r-—l _ r-—,^ /T7^\~l 



IT 



Yt-i; g M ^/W^{aIA'^/WTViT^fWT\ Vw^^t (33) 



for the scalar product kernel k{x, y) — {x, y) , the unit T xT matrix I , and a > 0. 

Proof. For the scalar product kernel we have we have K^ = X'X and \/Wt^t — 
\/WtXxt- By Lemma [5] wc obtain 

al + ^/W^XX'^/W^\ y/w^XxT = \/W^X{aI + X'WrXy'^xr ■ 
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It is easy to see that 

Y2+YA' 



Yt-i; 



WTX^rf^y,.,+ {^^^.r 



and 



T-l 



A 



X'WtX = y ^xtx't + xtx't 
Thus we obtain the formula (|5T|) from (I55| . 



D 
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A Appendix 

A.l Proofs of Technical Lemmas 

Proof of Lemma[J\ For T = 1 the inequality is trivial. Assume it for T — 1. 
Then 



1 "^ 



ft 



T-l 



T-l 



t=l 



< 2 



Bt Pt \ Pt-i 
Pt-i Bt- 



^E« 



t=i 



< 2^ 



JT-l 



Pi 



IB' 



T-l 



Pi 



Pi 



Bt-1 + Pt 



I Pt-1 Bt-1 
Pt V 

< 2^ 



' -Bt-1 + /3t 



/37 




The first inequality is by the induction assumption, and the second inequal- 
ity holds since Pt-i < Pt- The last inequality is 2y^/y/y + ^/y/^/x + y < 
2^x + yl\/y, which holds for any positive x and y. (Indeed, it is equivalent to 
2^/x^x + y + y < 2{x + y) and 2y/x^/x + y < x + y + x.) D 

Proof of Lemma This lemma is proven by taking the derivative of the quadratic 



forms in i^ by and calculating the minimum: BAng^^niO' A9 + c' 6) 
for any c G R" (see Theorem 19.1.1 in [S]). 



JA-^c)' , 



u 



Proof of Lemma C) After evaluating each of the integrals using Lemma [5] the 
ratio is represented as follows: 



/« 



i(«)d0 



Jr 



-Q^Wde 



_ gininoggn Q2(e)-mine£HTi Qi{e) 



The difference of minimums can be calculated using Lemma |2] with b 
and z = ^2^: 

min Q2{e) - min Qi{9) = C2 - ci - ^(63 + biyA-\b2 - 61) . 



_ 62+hi 
2 



D 
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Proof of Lemma\^ Consider the product of block matrices: 

/„ B\ fal^ + BC \ fain "^ \ fa^n \ ( In B 

I,n) \ -C alraj \-C aim) \-C al,n + CB) \ /„, 

Taking the determinant of both sides, and using formulas for the determinant 
of a block matrix, we get the statement of the lemma. D 

A. 2 An Alternative Derivation of Regression Algorithms 
Using Defensive Forecasting 

In this section we derive the upper bound and the algorithms using a different 
technique, the defensive forecasting |4j. 

A. 2.1 Description of the Proof Technique 

We denote the predictions of any expert (from a finite set or following strategies 
from Section SI by ^^ . For each step T and each expert 6 we define the function 

Q^ : r X f7 ^ [0, oo) 
We also define the mixture function 

T-l 



•J + 1 



with some initial weights distribution Po{d9) on the experts. Here rj is a. learning 
rate coefficient; it will be defined later in the section. We define the correspon- 
dence 

l''=p{Y2-Y,) + Yi, pe[0,l], (35) 

between [0, 1] and Learner's predictions jP e F. 

Let us introduce the notion of a defensive property. We use the notation 
Sn := {Yi,Y2}. Assume that there is a fixed bijection between the space V{Si^) 
of all probability measures on Sil and the set [0, 1]. Each p'^ E [0, 1] corresponds 
to some unique n e V{6fl). 

Definition 1. A sequence R of functions -Ri,i?27 ■ • ■ such that i?( : F x fi ^• 
(— oo, oo] is said to have the defensive property if, for any T and any ttt G V{6Q), 
it holds that 

E^,i?T(7^"",y)<l, (36) 

where E^ is the expectation with respect to a measure tt. 

A sequence R is called forecast- continuous if, for all T and all j/ G 51, all the 
functions RT{"f,y) are continuous in 7. 

We now prove that Qf has the defensive property. 

Lemma 7. For 77 e ( 0, (Y;-y )'^ 

is a forecast- continuous sequence having the defensive property. 
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Proof. The continuity is obvious. We need to prove that 

p^vih-Y.f-iit-y^f) + (1 _ pyn{h-nf-ie,~nf) < 1 (37) 

holds for all 7 e [li, ^^2] and t] e f 0, ry^^^yi ■ Indeed, for any 7 e M \ [^1,^2] 

there exists 7 G {Yi, 1^2} such that (7 — y)^ < (7 — y)^ for any y G 51. Since the 
exponent function is increasing, the inequality ([37]) for any 7 e R will follow. 

We use the correspondence ((35)1 . ^f = q{Y2 — Yi) + Yi for some (7 e M, and 
^ = 77(12 — Yi)^. Then we have to show that for all p G [0, 1], g e M and 

peM((i-P)^-(i-9)^) + (1 _p)eKp'-9^) < 1. 

If we substitute q = p + x, the last inequality will reduce to 

pg2M(i-p)x _^ (^ _ p)g-2^px < gpx^ ^ Vx e M. 

Applying HoefFding's inequality (see TT) to the random variable X that is equal 
to 1 with probability p and to with probability (1 — p), we obtain 

pe''(i-P) + (l-p)e-''P<e'''/s 
for any h E M. With the substitution h := 2/ix it reduces to 

pg2M(l-p):r ^ (1 _p)e-2MP^ < e'''^'/2 < gM^i:" ^ 

where the last inequality holds if /i < 2. The last inequality is equivalent to 
V ^ /y ^Y )2 ; which we assumed. D 

We will further use the maximum value for rj, rj = ,y y -.-i . 
The following lemma states the most important for us property of the se- 
quences having the defensive property originally proven in |15) . 

Lemma 8. Let R be a forecast- continuous sequence having the defensive prop- 
erty. For any T there exists p £ [0, 1] such that for all y G SQ 

Proof. Define a function ft : 5Vt x [0, 1] — > (~oo, oo] by 

ft{p,y) = Rt{Y,y)-l. 

Since R is forecast-continuous and the correspondence (|35|) is continuous, ft{y,p) 
is continuous in p. Since R has the defensive property, we have 

pfip,Y2) + {l~p)fil~P,Yi)<0 (38) 

for all p € [0, 1]. In particular, /(0,ri) < and /(1,F2) < 0. 

Our goal is to show that for some p G [0,1] we have f{p,Yi) < and 
/(p,>2) < 0. If /(0,r2) < 0, we can take p ^ 0. If /(l,Yi) < 0, we can take 
p = 1. Assume that f{0,Y2) > and /(I, Yi) > 0. Then the difference 

f{p):=f{p,Y2)-fip,Y,) 

is positive for p = and negative for p ^ 1. By the intermediate value theorem, 
f{p) = for some p e (0, 1). By (gHl) we have f{p, Y2) = f{p, Yi) < 0. D 
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This lemma shows that at each step there is a probabihty measure (corre- 
sponding to p G [0,1]) such that the sequence having the defensive property 
remains less than one for any outcome. 

The proof of the upper bounds for Defensive Forecasting is based on the 
following argument. 

Lemma 9. Assume that the sequence of functions Qf is forecast- continuous 
and has the defensive property. Then the mixtures Qt as functions of two vari- 
ables y,7 at the step t form a forecast- continuous sequence having the defensive 
property. 

Proof. The continuity easily follows from the continuity of Q^ and the in- 
tegration functional. We proceed by induction in T. For T = we have 
EirQo = EttI < 1. For T > assume that for any yi, . . . ,yT-2 G Sil and 
any 7i,...,7t-2 G T 

E^QT-i(yi,7i,---:yT-2,7T-2,y,7^'') < i 

for any n £ 1^(6 ft). Then by Lemma [8] there exists ttt^i G ViSfl) such that 

^ T-2 



Qt-i(2/i,7i,---,2/t-2,7t-2,2/,7'''""')= / H (Q*) "' Qt-i^o W < 1 



,nLf 



t=i 



(39) 
for any y £ Sft. We denote jt-i = 7^ ^ ^ and fix any yr-i G ft. We obtain 

E^Qt(2/i,7i,---,2/t-i,7t-i,2/,7''") 

T-l 



= E. / n (Q?(7t,2/*))"^^""'Qr(7^",y)^oW 
•'^ t=i 



^0 t=i 

T-l 



,nL-/- 



e 



t=i 

/T — 2 T-rT-2 



Ot-I 



e Xt- 



T— 2 \ ^'P-^ 



The first inequality holds because E7tQt{i^ ij/) < 1 for any tt G V{Srt). The 
penultimate inequality holds due to the concavity of the function x" with x > 0, 
a G [0, 1]. The last inequality holds due to (|5^. This completes the proof. D 

By Lemma [5] at each step t there exists a prediction 74 such that Qt is less 
than one. Now we only need to generalize Lemma [8] for the case when the 
outcome set is the full interval: il — [Yi,Y2]. 

Lemma 10. If ^t is such that QTiyi,Ji,---,yT-i,jT-i,y,jT) £ 1 for all 
y e {Yi,Y2}, then Qt(2/i,7i, • ■ • ,2/t-i,7t-i,2/,7t) < 1 for all y G [Fi,y2]- 
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Proof. Note that any y G [^1,^2] can be represented as y = m1t,2 + (1 ^ u)Ft,i 
for some u E [0, 1]. Thus 

(Ci - y)' - (C2 - y)' = C? - a - 2y(Ci - C2) 

= ^i[(Ci - Y,y ~ (C2 - ¥2)'] + (1 - u)[(Ci - Y,f - (C2 - n)2] 

for any Ci j C2 G M. Due to the convexity of the exponent function we have for 
any ry > 

^nl{Ci-yf-{C2-yf] < ^gVl{Ci-Y2f-{C2-Y2f] ^ j^^ _^'^^v[(Ci-Yif-{C2-Yif]_ 

Thus 

Q?^(7T, 2/) < uQ'^i-fT,Y2) + (1 - u)Q?,(7T, Fi) 

and therefore 

QT{yi,ji,---,yT-i,jT-i,y,jT) < uQT{yi,ji,...,yT-i,jT-i,Y2,jT) 

+ (1 - u)QT{yi,ji, . ■ . ,yT-i,jT-i,Yi,jT) < 1 

where the second inequahty fohows from the condition of the lemma. D 

Finally we obtain 

T-l 



TT gvTlT^t^ a,(\i',t,yt}-m^ ,ijt)) ^v{HiT.yT)-MiT-yT)) p^ug) < 1. (40) 



e t=i 

A. 2. 2 Derivation of the Prediction Strategies Using Defensive Fore- 
casting 

Lemma [5] describes an explicit strategy of making predictions. This strategy 
relies on the search for a fixed point and may become very inefficient especially 
for the cases of infinite number of experts. Therefore we develop a more efficient 
strategies for each of our problems. 

We first note that the strategy in Lemma |8] solves 

T-l 



/ TT e''nr="t'".(^(7t,yt)-A(«f.y*))g')(A(7,>2)-A(4?,,y2))p^(-^^^ 

T— 1 
•^© t=l 

in 7 G [Yi , 1^2] if the trivial predictions are not satisfactory (the integral becomes 
a sum in the case of finite number of experts). We define 



/ -^0 J — 1 



T-l 

p-';A(«T:y) 
for any y G J7. Rewriting the equation for the root we have 

gVi^Th,Y2)-gT{Y2)) _ gV{>^Th:Yi)-gTiYi)) ^ q 
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Moving the second exponent to the right-hand side and taking log of both sides 
we obtain 

X{^,Y2) - gT{Y2) ^ \{l,Y,) - griYi). (42) 

For the square loss we can solve (j42l) in 7: 

y2±Y^_ 9iY,)-9{Y,) 
^ 2 2(^2 -n) ■ ^ ^ 

This formula for predictions is equivalent to (pO|. 
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